Importing packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
df = pd.read_csv("kc_house_data.csv").set_index('id')
df['date'] = df['date'].apply(lambda x: int(x.split('T')[0]))
df.astype('float')
norm_df = (df - df.mean()) / df.std()
label = norm_df.pop('price')
train, test, labels_train, labels_test = train_test_split(norm_df, label, train_size=0.80)
train.columns
def classifier(model):
model.fit(train, labels_train)
pred = model.predict(test)
print(f"MSE: {mean_squared_error(pred, labels_test)}, MAE: {mean_absolute_error(pred, labels_test)}")
return (model, pred)
model_rf, pred_rf = classifier(RandomForestRegressor(n_estimators = 100, random_state = 0))
model_lin, pred_lin = classifier(LinearRegression())
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(train,
mode='regression',
feature_names=train.columns,
discretize_continuous=False
)
def exp_rf_inst(i):
exp = explainer.explain_instance(train.iloc[i], model_rf.predict)
exp.show_in_notebook(show_table=True, show_all=False)
exp_rf_inst(10)
In 10th observation we can see, that the most positive variables are sqft_living, lat and grade, with 0.27, 0.25 and 0.2 respectively. First negative variable is long with -0.09. That means that sqft_living and lat counts nearly exactly the same.
exp_rf_inst(100)
In 100th observation we can see, that the most positive variables are sqft_living, lat and grade, with 0.27, 0.26 and 0.19 respectively. First negative variable is long with -0.1. That means that sqft_living and lat counts nearly exactly the same, and is very close to explanation for 10th observation.
exp_rf_inst(1000)
In 1000th observation we can see, that the most positive variables are sqft_living, lat and grade, with 0.28, 0.26 and 0.19 respectively. First negative variable is long with -0.11. That means that sqft_living and lat counts nearly exactly the same. Sqft_living and lat are same as with 100th observation.
After comparing 3 observations: the answers are stable. First 4 (most important) variables are nearly the same in all 3 observations. First negative is always long with stable explanation.
def exp_lin_inst(i):
exp = explainer.explain_instance(train.iloc[i], model_lin.predict)
exp.show_in_notebook(show_table=True, show_all=False)
exp_lin_inst(10)
First 5 most important variables are grade, lat, sqft_living, sqft_above and yr_built with 0.3, 0.23, 0.21 and -0.21 respectively. Let's recall explanation for random forest model.
exp_rf_inst(10)
There are many differences between LIME decompositions between models for 10th observation. Biggest difference is in sqft_above, which is negligible negative in random forest, but large value in linear model.
Lime explanation was very stable on all observation in one model, but very different between models.